27 research outputs found

    Future of networking is the future of Big Data, The

    Get PDF
    2019 Summer.Includes bibliographical references.Scientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science

    Supporting Climate Research using Named Data Networking

    Get PDF
    Abstract-Climate and other big data applications face substantial problems in terms of data storage, retrieval, sharing and management. While several community repositories and tools are available to help with climate data, these problems still persist and the community is actively looking for better solutions. In this project we apply NDN to support climate modeling applications. The information-centric nature of NDN, where content becomes a first class entity, simplifies many of the problems in this domain. NDN offers lightweight data publication, discovery and retrieval compared to IP-based solutions. However, introducing a new network architecture to a mature domain that routinely produces petabytes of datasets and a plethora of assorted tools to manipulate them, is a risky proposition. The advantages of NDN alone may not be sufficient to overcome the natural inertia. Our approach is to introduce NDN while carefully avoiding undue disruption to existing workflows. To that extent we employ a user interface that employs familiar filesystem operations to publish, discover and retrieve data, integrated with domain-specific translators that automatically convert and publish datasets as NDN objects. We outline the advantages of NDN in this application domain and the challenges we faced during the adaptation. We believe this is the first exercise in applying NDN in an existing, large, mature application domain

    Managing scientific data with named data networking

    Get PDF
    Many scientific domains, such as climate science and High Energy Physics (HEP), have data management requirements that are not well supported by the IP network architecture. Named Data Networking (NDN) is a new network architecture whose service model is better aligned with the needs of data-oriented applications. NDN provides features such as best-location retrieval, caching, load sharing, and transparent failover that would otherwise be painstakingly (re-)implemented by each application using point-to-point semantics in an IP network. We present the first scientific data management application designed and implemented on top of NDN. We use this application to manage climate and HEP data over a dedicated, high-performance, testbed. Our application has two main components: a UI for dataset discovery queries and a federation of synchronized name catalogs. We show how NDN primitives can be used to implement common data management operations such as publishing, search, efficient retrieval, and publication access control

    Named Data Networking based File Access for XRootD

    Get PDF
    We present the design and implementation of a Named Data Networking (NDN) based Open Storage System plug-in for XRootD. This is an important step towards integrating NDN, a leading future internet architecture, with the existing data management systems in CMS. This work outlines the first results of data transfer tests using internal as well as external 100 Gbps testbeds, and compares the NDN-based implementation with existing solutions

    Hydra -- A Federated Data Repository over NDN

    Full text link
    Today's big data science communities manage their data publication and replication at the application layer. These communities utilize myriad mechanisms to publish, discover, and retrieve datasets - the result is an ecosystem of either centralized, or otherwise a collection of ad-hoc data repositories. Publishing datasets to centralized repositories can be process-intensive, and those repositories do not accept all datasets. The ad-hoc repositories are difficult to find and utilize due to differences in data names, metadata standards, and access methods. To address the problem of scientific data publication and storage, we have designed Hydra, a secure, distributed, and decentralized data repository made of a loose federation of storage servers (nodes) provided by user communities. Hydra runs over Named Data Networking (NDN) and utilizes the State Vector Sync (SVS) protocol that lets individual nodes maintain a "global view" of the system. Hydra provides a scalable and resilient data retrieval service, with data distribution scalability achieved via NDN's built-in data anycast and in-network caching and resiliency against individual server failures through automated failure detection and maintaining a specific degree of replication. Hydra utilizes "Favor", a locally calculated numerical value to decide which nodes will replicate a file. Finally, Hydra utilizes data-centric security for data publication and node authentication. Hydra uses a Network Operation Center (NOC) to bootstrap trust in Hydra nodes and data publishers. The NOC distributes user and node certificates and performs the proof-of-possession challenges. This technical report serves as the reference for Hydra. It outlines the design decisions, the rationale behind them, the functional modules, and the protocol specifications

    Named Data Networking in Climate Research and HEP Applications

    Get PDF
    The Computing Models of the LHC experiments continue to evolve from the simple hierarchical MONARC[2] model towards more agile models where data is exchanged among many Tier2 and Tier3 sites, relying on both large scale file transfers with strategic data placement, and an increased use of remote access to object collections with caching through CMS's AAA, ATLAS' FAX and ALICE's AliEn projects, for example. The challenges presented by expanding needs for CPU, storage and network capacity as well as rapid handling of large datasets of file and object collections have pointed the way towards future more agile pervasive models that make best use of highly distributed heterogeneous resources. In this paper, we explore the use of Named Data Networking (NDN), a new Internet architecture focusing on content rather than the location of the data collections. As NDN has shown considerable promise in another data intensive field, Climate Science, we discuss the similarities and differences between the Climate and HEP use cases, along with specific issues HEP faces and will face during LHC Run2 and beyond, which NDN could address

    Request aggregation, caching, and forwarding strategies for improving large climate data distribution with NDN: A case study

    No full text
    Scientific domains such as Climate Science, High Energy Particle Physics (HEP) and others, routinely generate and manage petabytes of data, projected to rise into exabytes [26]. The sheer volume and long life of the data stress IP network- ing and traditional content distribution networks mechanisms. Thus, each scientific domain typically designs, develops, im- plements, deploys and maintains its own data management and distribution system, often duplicating functionality. Sup- porting various incarnations of similar software is wasteful, prone to bugs, and results in an ecosystem of one-off solutions. In this paper, we present the first trace-driven study that investigates NDN in the context of a scientific application domain. Our contribution is threefold. First, we analyze a three-year climate data server log and characterize data access patterns to expose important variables such as cache size. Second, using an approximated topology derived from the log, we replay log requests in real-time over an NDN simulator to evaluate how NDN improves traffic flows through aggregation and caching. Finally, we implement a simple, nearest-replica NDN forwarding strategy and evaluate how NDN can improve scientific content delivery

    Scari: A strategic caching and reservation protocol for ICN

    No full text
    The point-to-point resource reservation solutions over IP networks are often end-to-end, and data flowing through these reserved tunnels are not reusable. As a result, the in-network resources are not optimally utilized. Information Centric Networking (ICN) has several properties that can more intelligently facilitate resource reservations. In this paper, we present Strategic Caching And Reservation in ICN (SCARI) for reserving resources on ICN networks. Preliminary simulation results indicate that SCARI can reduce bandwidth consumption and free up network resources by aggregating reservation requests and strategically caching content in the network
    corecore